In [ ]:
%%HTML
<style>
.container { width:100% }
</style>

Iris Classification with Naive Bayes using SciKit Learn


In [ ]:
import pandas as pd
import numpy  as np

In [ ]:
IrisDF = pd.read_csv('iris.csv')
IrisDF.head()

We extract the set of all species occurring in the DataFrame and convert this list into a set.


In [ ]:
Species  = list(set(IrisDF['species']))
Species

We extract the feature names. This can be done conveniently by converting the DataFrame into a list. However, we do not need the feature 'species' since this is the dependent variable. Fortunately, this feature is the last element in the list, so we can easily drop it.


In [ ]:
Features = list(IrisDF)[:-1]
Features

SciKitLearn provides a classifier that is based on the Naive Bayes algorithm that assumes that continuous variables have a Gaussian distribution.


In [ ]:
from sklearn.naive_bayes import GaussianNB

We extract the independent variables and store them in the design matrix X.


In [ ]:
X = IrisDF[Features]

We extract the dependent variable and store it in Y.


In [ ]:
Y = IrisDF['species']

We construct a naive Bayes classifier that assumes a normal distribution and fit the model with our data. This classifier assumes that $$ P(f=x | C) = \frac{1}{\sqrt{2\cdot\pi\;}\cdot \sigma_{f,C}} \cdot \exp\left(-\frac{\bigl(x-\mu_{f,C}\bigr)^2}{2 \cdot \sigma_{f,C}^2}\right). $$ Here $ P(f=x | C)$ is the conditional probability density that the feature $f$ has the value $x$ given that $C$ is the species of the flower under investigation. $\mu_{f,C}$ is the mean value of the feature $f$ for the class $C$, while $\sigma_{f,C}^2$ is the variance of the feature $f$ for the class $C$.


In [ ]:
classifier = GaussianNB()

We train the classifier with our data.


In [ ]:
classifier.fit(X, Y)

We compare the predicted values of our classifier with the actual values and compute the accuracy.


In [ ]:
np.sum(classifier.predict(X) == Y) / len(Y)

In [ ]: